Part I - Ford GoBike Dataset Exploration

by Darragh Merrick

fordGoBike.jpg

Introduction

Dataset Overview and Notes

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

Note that this dataset will require some data wrangling in order to make it tidy for analysis. There are multiple cities covered by the linked system, and multiple data files will need to be joined together if a full year’s coverage is desired. If you’re feeling adventurous, try adding in analysis from other cities, following links from this page.

Example Topics/Questions

When are most trips taken in terms of time of day, day of the week, or month of the year?

How long does the average trip take?

Does the above depend on if a user is a subscriber or customer?

Rubric Tip: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

Rubric Tip: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

Rubric Tip: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.

Preliminary Wrangling

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

Data Cleaning

Missing data

Visualise missing data in the dataset

There are quality and tidiness issues in the data, that will need to be addressed. The datatypes of multiple columns will need to be changed to gain insights such as:

There are missing values in:

There are also invalid birth year values

There are other elements I could add like distance between start and stop, which would add useful information.

I would like to visualize the start and end locations on a map, which could also add useful information to this data study, even though it's not covered on this course.

What is the structure of your dataset?

The dataset has 183412 rows and 16 columns.

Exploratory Data Visualizations

Formula to calculate distance between 2 locations

formula.png

distance.png

Tested results against online calculator https://www.calculator.net/distance-calculator.html

Looking at the 1 duration of 69.47:

start_lat, long end lat, long duration Distance
37.7896254,-122.400811 37.3172979,-121.884995 6945 seconds (1.93 hours) 69.47km

5km every 10 minutes is an estimated average, which would give me 30km over the hour, so this would not be impossible.

From the map below, it looks like someone cycled from San Francisco to San Jose. This would not be impossible, so not going to filter it out as an outlier.

Univariate Exploration

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

Rubric Tip: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set. Use reasoning to justify the flow of the exploration.

Rubric Tip: Use the "Question-Visualization-Observations" framework throughout the exploration. This framework involves asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.

What is/are the main feature(s) of interest in your dataset?

The trip duration and start and end station Lat Longs could generate interesting results. Most popular start stations and end Stations could show interesting trends. Start and end times show year-month-day, so we can find trends of popular times, days, months and seasons. Statistics about gender and age may also show the most popular groups that tend to cycle.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Gender, Age and user type will help profile the customers Start_station_ID and End_Station_id will help find the most popular routes, to assist bike redistribution. Bikes will need to be moved from the most popular destinations to the most popular starting points to keep bikes available at popular starting points. start_time and end_time will help investigate cycle durations and peak times.

Rubric Tip: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

In the duration_sec graph, I used a log scale to get an uniform distribution. The count of Age and Distances are positively skewed distributions

A distribution is said to be skewed to the right if it has a long tail that trails toward the right side. The skewness value of a positively skewed distribution is greater than zero.

The count of birth years is a negatively skewed ditribution.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I didn't find any of the results to be unusual. The most common age groupos were 25 -40 The most common trips were between 0.5 to 2.5 km, although there was one trip from San francisco to San Jose which was 69.47km. I did find it surprising that over 70% of users were male, I would have expected a more even gender usage.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

*

Conclusions

During this project I had to clean the data and used folium to plot the Lat Long start and finish locations to visualise the journeys. I analysed the data by plotting exploratory graphs, then used Univariate, Bivariate and Multivariate graphs to further explore the relationships and trends od duration and distances cycled by age, gender and user type. I felt ther was no major surprises from the findings, other than males were the majority of users.

References